The 2009 Knowledge Discovery in Data Competition ( KDD Cup 2009 ) Challenges in Machine Learning

نویسندگان

  • Gideon Dror
  • Marc Boullé
  • Isabelle Guyon
  • Vincent Lemaire
  • David Vogel
چکیده

We organized the KDD cup 2009 around a marketing problem with the goal of identifying data mining techniques capable of rapidly building predictive models and scoring new entries on a large database. Customer Relationship Management (CRM) is a key element of modern marketing strategies. The KDD Cup 2009 offered the opportunity to work on large marketing databases from the French Telecom company Orange to predict the propensity of customers to switch provider (churn), buy new products or services (appetency), or buy upgrades or addons proposed to them to make the sale more profitable (up-selling). The challenge started on March 10, 2009 and ended on May 11, 2009. This challenge attracted over 450 participants from 46 countries. We attribute the popularity of the challenge to several factors: (1) A generic problem relevant to the Industry (a classification problem), but presenting a number of scientific and technical challenges of practical interest including: a large number of training examples (50,000) with a large number of missing values (about 60%) and a large number of features (15,000), unbalanced class proportions (fewer than 10% of the examples of the positive class), noisy data, presence of categorical variables with many different values. (2) Prizes (Orange offered 10,000 Euros in prizes). (3) A well designed protocol and web site (we benefitted from past experience). (4) An effective advertising campaign using mailings and a teleconference to answer potential participants questions. The results of the challenge were discussed at the KDD conference (June 28, 2009). The principal conclusions are that ensemble methods are very effective and that ensemble of decision trees offer off-the-shelf solutions to problems with large numbers of samples and attributes, mixed types of variables, and lots of missing values. The data and the platform of the challenge remain available for research and educational purposes at http://www.kddcup-orange.com/.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Predicting customer behaviour: The University of Melbourne's KDD Cup report

We discuss the challenges of the 2009 KDD Cup along with our ideas and methodologies for modelling the problem. The main stages included aggressive nonparametric feature selection, careful treatment of categorical variables and tuning a gradient boosting machine under Bernoulli loss with trees.

متن کامل

Bennett Netflix 100 Winchester Circle

INTRODUCTION The KDD Cup is the oldest of the many data mining competitions that are now popular [1]. It is an integral part of the annual ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD). In 2007, the traditional KDD Cup competition was augmented with a workshop with a focus on the concurrently active Netflix Prize competition [2]. The KDD Cup itself in 2007 con...

متن کامل

Introduction to Domain Driven Data Mining

The mainstream data mining faces critical challenges and lacks of soft power in solving real-world complex problems when deployed. Following the paradigm shift from ‘data mining’ to ‘knowledge discovery’, we believe much more thorough efforts are essential for promoting the wide acceptance and employment of knowledge discovery in real-world smart decision making. To this end, we expect a new pa...

متن کامل

Application of Additive Groves Ensemble with Multiple Counts Feature Evaluation to KDD Cup'09 Small Data Set

This paper describes a field trial for a recently developed ensemble called Additive Groves on KDD Cup’09 competition. Additive Groves were applied to three tasks provided at the competition using the ”small” data set. On one of the three tasks, appetency, we achieved the best result among participants who similarly worked with the small dataset only. Postcompetition analysis showed that less s...

متن کامل

Proceedings of the Third International Workshop on Knowledge Discovery from Sensor Data, Paris, France, June 28, 2009

Climate modeling and analysis of climate change have largely been based on forward simulation with physical models. We propose here a data centric approach to climate study based solely on the actual observed data. This novel approach utilizes a variety of relevant statistical modeling and machine learning techniques such as spatial-temporal causal modeling and extreme value modeling, and sugge...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2011